Final Project STAT 331

Author

Emma Turilli, John Ieng, Nat Sakamoto, Gabby Apsay

Reproducibility

All code, raw data, and project files are available in our GitHub Repository. Feel free to explore or replicate our analysis!

1 Project Proposal + Data

This analysis utilizes the life expectancy and the gross domestic product (GDP) datasets sourced Gapminder, a non-profit organization whose mission “is to fight devastating ignorance with a fact-based world view everyone could understand.” Their site provides data sets collected from many reputable sources and interactive visualizations on important world topics.

1.1 Data Cleaning

In the raw GDP dataset, some values included a “k” suffix to represent thousands of dollars (e.g., 10,000 to 10k). The first step is to figure out a way to convert GDP values into numeric form. To keep values constant, we created a function that converts these abbreviated values into their full numeric form, allowing for accurate numeric comparisons. Without this step, any observations containing a “k” would be dropped, leaving it empty and could potentially affecting later analysis.

1.2 Pivoting Longer

The life expectancy data contains information about the life expectancy for 196 countries from the year 1800 to 2100. It provides the life expectancy in years for each country within the set. For the period from 1800 to 1970, the data was sourced from Gapminder’s main source v7: by Mattias Lindgren. Data for 1950-2019 was from the Global Burden of Disease Study 2019, which has 1950-2019 from the IHME. For 2020-2100, Gapminder used UN forecasts from the World Population Prospects 2022.

Life Expectancy Info from: https://www.gapminder.org/data/documentation/gd004

The GDP data was obtained from the Madison Project Database (MPD) and Penn World Table (PWT). This data set contains information on gross domestic product (GDP) per person adjusted for differences in purchasing power in international dollars, and fixed 2017 prices. GDP per capita measures the value of everything a country produces during a year, divided by the number of people. We transformed the data to have columns containing the country, year, and GDP of interest.

GDP Info from: https://www.gapminder.org/data/documentation/gd001/

We transformed each of the individual year columns into one singular column so that the dataset would be easier to read. As a result, each observation consists of one country and year, with the corresponding life expectancy. The raw GDP data is similar to the life expectancy data in that each year has its own column. So we transformed the data in a similar way, making year its own column with its corresponding GDP.

1.3 Joining Datasets

country year life_expectancy gdp
Afghanistan 1800 28.2 481
Afghanistan 1801 28.2 481
Afghanistan 1802 28.2 481
Afghanistan 1803 28.2 481
Afghanistan 1804 28.2 481
Afghanistan 1805 28.2 481

After cleaning up each data set, we had to join the two together by our observational unit, country. We hypothesize that as GDP increases, life expectancy will also begin to increase, as a higher GDP correlates to better infrastructure and more/better access to healthcare and medicine.

2 Linear Regressions

2.1 Data Visualization

2.2 Linear Regression


Call:
lm(formula = life_expectancy ~ avg_gdp, data = gdp_lex)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.325 -19.320  -2.212  20.231  40.549 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.805e+01  1.310e-01  366.71   <2e-16 ***
avg_gdp     4.084e-04  7.205e-06   56.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.86 on 57042 degrees of freedom
Multiple R-squared:  0.05332,   Adjusted R-squared:  0.05331 
F-statistic:  3213 on 1 and 57042 DF,  p-value: < 2.2e-16

2.3 Model Fit

Variances
Response Fitted Values Residuals
459.83 24.52 435.31